Introduction to data visualization, web scraping, and text analysis in R: Session 2a

Visualizing data with ggplot2

J.M.T. Roos

Last updated: 2019-06-26 12:51:50

Review

  • In class
    • git — cloning, committing, pushing
    • R Markdown — mixing text with R code using ```{r} blocks
  • This past week and/or during the break
    • Reviewing/learning the R programming language
    • Installing packages

Applied problem: Merging samples

Repeated measures for 11 individuals, mean (sd)

Round Duration Number Correct
1 7.5 9.0
(2.0) (3.3)
2 7.5 9.0
(2.0) (3.3)
3 7.5 9.0
(2.0) (3.3)
4 7.5 9.0
(2.0) (3.3)

Applied problem: Merging samples

Regression of Duration on Number Correct repeated for each round

Round Term Estimate SE
1 (Intercept) 3.0 1.12
num.correct 0.5 0.12
2 (Intercept) 3.0 1.13
num.correct 0.5 0.12
3 (Intercept) 3.0 1.12
num.correct 0.5 0.12
4 (Intercept) 3.0 1.12
num.correct 0.5 0.12

Remember…

Always look at the data first

ALWAYS. LOOK. AT. THE. DATA.

Today

  • This session: Data visualization with ggplot2
    • Basics
    • In-class exercises
    • More advanced concepts
    • In-class exercises
  • Later: Tidying and summarizing data

ggplot2

  • Plotting package in R intended to replace the core plotting routines
  • Based on the concept of a grammar of graphics
    • Plots are constructed from simpler components, much as sentences are constructed from nouns, verbs, etc.
    • Not all arrangements of words lead to comprehensible sentences — the same is true for plots, and ggplot2 helps you avoid (visual) nonsense
    • This approach leads to a modularity of design, making it easy for programmers to extend
  • Sensible and aesthetically pleasing default settings
    • Informed by what we know about visual perception and cognition

What is a graph?

A visual display that illustrates one or more relationships among numbers…a shorthand means of presenting information that would take many more words and numbers to describe.

—Stephen M. Kosslyn. Graph Design for the Eye and Mind. Oxford University Press, 2006

It depends on the goal:

  • A tool for discovery — gain an overview of, convey the scale and complexity of, or facilitate an exploration of data (dataviz)
  • A tool for communication — help you to help others understand, tell a story about, or stimulate interest in a problem or solution (infographics)

At a minimum…

  • A graph is for comparing quantities
    • Always ask yourself: “What is main comparison?”
  • A graph should answer a central question
    • Both the question and answer should be clear
  • Both should be obvious to you and the viewer

Psychological principles (Kosslyn, 2006)

  • I will briefly cover some of what we know about human cognition of data visualizations

  • If you want to know more, the book by Kosslyn (see quote earlier) is a good reference

  • Book divides what we know into 8 principles, which I think fall into 3 buckets:
    • Get their attention
    • Hold and direct their attention
    • Help them remember

Get their attention

  1. Relevance
    • Not too much or too little information
    • Present information that reflects the message you want to convey
    • Don’t present extraneous information
  2. Appropriate knowledge
    • Prior knowledge must be sufficient to understand the graph
    • If you assume too much prior knowledge, viewers will be confused
    • If you violate norms, viewers will be confused

If they are confused, they won’t try to understand your graph

Hold and direct their attention

  1. Salience
    • Attention is drawn to large perceptible differences
    • The most visually striking aspect receives the most attention
    • Annotations help direct viewers’ attention
  2. Discriminability
    • Properties must differ enough to be noticed
    • Defaults in ggplot2 do much of this work for you
  3. Organization
    • Groups of elements are seen and remembered as a whole

Try to anticipate the process the audience will go through while looking at your graph

Help them remember

  1. Compatibility
    • Form should be aligned with meaning
    • Lines express continuous change, bars discrete quantities
    • More = more (higher, better, bigger, etc.)
  2. Informative changes
    • Changes in properties should carry information
    • …and vice versa
  3. Capacity limitations
    • If too much information is presented, none is remembered
    • Four chunks in working memory
    • Graph designers err on the side of presenting too much, graph readers err on the side of paying too little attention

Decide what you want them to remember; everything else is secondary to that

ggplot2’s grammar

  • Decomposes graphs into basic parts
  • Sets rules for interactions among those parts
  • Helps us stay out of trouble

ggplot2’s grammar

  • Default values for Data and Mapping available to all layers
  • Layers — one or more, each with the following:
    • Data (overriding the default) — a data.frame
    • Mapping (overriding the default) of columns to Aesthetics
    • Geometry specifying what to draw
    • Statistic specifying how to transform the data before drawing
    • Position specifying how to arrange items
  • Facet specification for generating subplots
  • Scales specifying how to translate the data to lengths, colors, sizes, etc. in the graph
  • Coordinates which is the default (Cartesian) 99% of the time, so ignore for now

Layers

  • Layers contain everything we see, often showing different views of the same data

Test data

test_data
## # A tibble: 44 x 4
##    round respondent num.correct duration
##    <fct> <fct>            <dbl>    <dbl>
##  1 1     1                   10     8.04
##  2 1     2                    8     6.95
##  3 1     3                   13     7.58
##  4 1     4                    9     8.81
##  5 1     5                   11     8.33
##  6 1     6                   14     9.96
##  7 1     7                    6     7.24
##  8 1     8                    4     4.26
##  9 1     9                   12    10.8 
## 10 1     10                   7     4.82
## # … with 34 more rows

Defaults

  • Specify the defaults first
  • Most graphs use a single set of data (data.frame) for every layer
  • Most graphs use a single set of mappings between columns and aesthetics
my_plot <- ggplot(data = test_data, mapping = aes(x = duration,
    y = num.correct))
  • aes() is used to create a list of aesthetic mappings
    • x refers to the graph’s x-axis, y to the y-axis
    • duration \(\rightarrow\) x-axis
    • num.correct \(\rightarrow\) y-axis
  • my_plot now represents a ggplot object set to our defaults
  • You don’t need to name the arguments; data comes first, mapping comes second
my_plot <- ggplot(test_data, aes(x = duration, y = num.correct))

An empty plot

  • Defaults by themselves do nothing
print(my_plot)

  • By default, we get an “empty” plot
  • To see something, we need to specify a layer

Adding a layer

  • Use the + operator to combine ggplot elements
my_plot + geom_point()

  • Usually you do not need the print() call, so the following two lines are equivalent:
    my_plot + geom_point()
    print(my_plot + geom_point())

Each layer has a geometry

my_plot + geom_point()
my_plot + geom_line()

my_plot + geom_point() + geom_line()

Each layer has a statistic

  • Usually the statistic is the identity function, \[f(x)=x\] That is, the data are left unchanged
  • The default statistic for geom_point and geom_line is identity so these plots show the data as is
  • The default statistic for geom_histogram is a binning function (called stat_bin)
ggplot(test_data, aes(x = duration)) + geom_histogram(binwidth = 2)

Result of applying binning function to duration

## # A tibble: 44 x 4
##    round respondent num.correct duration
##    <fct> <fct>            <dbl>    <dbl>
##  1 1     1                   10     8.04
##  2 1     2                    8     6.95
##  3 1     3                   13     7.58
##  4 1     4                    9     8.81
##  5 1     5                   11     8.33
##  6 1     6                   14     9.96
##  7 1     7                    6     7.24
##  8 1     8                    4     4.26
##  9 1     9                   12    10.8 
## 10 1     10                   7     4.82
## # … with 34 more rows
## # A tibble: 5 x 2
##       x     y
##   <dbl> <dbl>
## 1     4     4
## 2     6    13
## 3     8    20
## 4    10     5
## 5    12     2

Geoms and statistics

  • Each geom/statistic has a default statistic/geom
Item Default stat/geom
geom_point stat_identity (\(f(x)=x\))
geom_line stat_identity (\(f(x)=x\))
geom_histogram stat_bin (binning)
geom_smooth stat_smooth (regression)
stat_smooth geom_smooth (line + ribbon)
stat_bin geom_bar (vertical bars)
stat_identity geom_point (dots)
  • Hence, these produce the same output:
    ggplot(test_data, aes(x = duration)) + stat_bin(binwidth = 1)
    ggplot(test_data, aes(x = duration)) + geom_histogram(binwidth = 1)

Data versus statistics

  • Be sure you understand: “Does this layer contain data or statistics?”
  • When in doubt, prefer data to statistics
  • Example: Scatter plot conveys more information than a box plot
ggplot(test_data, aes(x = round,
  y = duration)) + geom_point()

ggplot(test_data, aes(x = round,
  y = duration)) + geom_boxplot()

Aesthetics

  • Each geometry interacts with one or more aesthetics
Item Required Optional
geom_point xy alphacolourfillshapesizestroke
geom_line xy alphacolourlinetypesize
geom_pointrange xymaxymin alphacolourlinetypesize
  • You can either map data to an aesthetic, or set it explicitly
my_plot + geom_point(
  mapping = aes(colour = round))

my_plot + geom_point(
  colour="red")

Position

  • Each layer also has a position specification
  • The default is again identity meaning don’t do anything special
  • Examples: bars can be positioned with stack or dodge
g <- ggplot(test_data, aes(x = num.correct, fill = round))
g + stat_bin(binwidth = 4,
             position = 'stack')

g + stat_bin(binwidth = 4,
             position = 'dodge')

Practice with layers (Tasks 1–4)

  • Work with a neighbor
  • First discuss the task, then one of you does the typing (take turns for each task)
  • Discuss what you are doing as you write code
  • Write your code in an empty File > New File… > R Script and execute each line using Cmd-Enter (Mac) or Control-Enter (Windows)
  • Use the data set called mpg which is included in the ggplot2 package
  • Exercises are on Canvas

Data

library(ggplot2) # or: library(tidyverse)
?mpg
Fuel economy data from 1999 and 2008 for 38 popular models of car

Description:
     This dataset contains a subset of the fuel economy data that the
     EPA makes available on http://fueleconomy.gov. It contains
     only models which had a new release every year between 1999 and
     2008 - this was used as a proxy for the popularity of the car.

Usage:
     mpg
     
Format:
     A data frame with 234 rows and 11 variables

     manufacturer
     model         model name
     displ         engine displacement, in litres
     year          year of manufacture
     cyl           number of cylinders
     trans         type of transmission
     drv           f = front-wheel drive, r = rear wheel drive, 4 = 4wd
     cty           city miles per gallon
     hwy           highway miles per gallon
     fl            fuel type
     class         "type" of car
mpg
## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

Task 0 (Example)

  • Create a plot with 1 layer:
    • x mapped to cty
    • y mapped to hwy
    • point geometry
    • identity stat
    • identity position

Do Tasks 1–4

Facets and discrete groups

  • Two main options when comparing subsets of data
    • Each discrete set is given a different colour, shape, or size
    • Each discrete set is plotted in its own facet
g <- ggplot(mpg, aes(x = displ, y = hwy))
g + geom_point(aes(colour = drv))

g + geom_point() + facet_wrap(~drv)

Groups

  • When you map discrete variables to colour, shape, or size, ggplot2 automatically maps those variables to group
  • The group aesthetic controls how collections of items are rendered
    • In geom_line the group aesthetic determines which points will be connected by a continuous line
    • In stat_summary the group aesthetic determines which points are summarised by a common statistic
  • If a variable v is continuous but you want to use it for grouping, either specificy group = v or transform it into a discrete variable, e.g., colour = factor(v)
ggplot(mpg, aes(x = displ, y = hwy,
              colour=cyl)) +
  geom_point() + geom_smooth()

ggplot(mpg, aes(x = displ, y = hwy,
              colour=factor(cyl))) +
  geom_point() + geom_smooth()

  • To override the automatic grouping, specify aes(group=1) when creating a layer
ggplot(mpg, aes(x = displ, y = hwy, colour = factor(cyl))) +
    geom_point() + geom_smooth(aes(group = 1))

Scales

  • Scales apply to the entire plot, i.e., to every layer
  • ggplot2 can detect what type of scale you might want, but it isn’t perfect
  • For example, you might want a logarithmic scale instead of the default linear scale
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() +
    scale_y_log10(breaks = c(15, 30, 45))

Labels

  • Always annotate graphs with a title and human-readable labels for each aesthetic
    • x- and y-axes
    • Legends and colour bars
ggplot(mpg, aes(x = displ,
                y = hwy,
                colour = drv)) +
 geom_point() +
 labs(x = "Displacement (litres)",
      y = "Highway miles per gallon",
      colour = "Drive train",
      title = "Automobile features")

Relabelling

mpg2 <- mpg %>%
  mutate(drv2 = case_when(drv == 'f' ~ 'Front',
                          drv == '4' ~ '4WD',
                          drv == 'r' ~ 'Rear'))
ggplot(mpg2, aes(x = displ, y = hwy, colour = drv2)) + geom_point() +
  labs(colour = "Drive train")

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() +
  facet_wrap(~ drv, labeller = as_labeller(c('f' = 'Front',
                                             'r' = 'Rear',
                                             '4' = '4WD')))

  • Another alternative is to use the forcats package to relabel/reorder factors

Task 5

More reading